Welcome to Intern Insight, Data Tech College’s Dedicated Student Internship Admissions Portal!
The goal:
Interact with the data for an enhanced user experience.
Provide effective insights into the internship activity of Data Tech College students each year.
Foster data-driven action to help advance the careers of the students.
About the Data
The dataset contained summer internship results on 80 students who attend Data Tech College. Features within the data included academic attributes such as the student’s test score, GPA and writing scores. Holistic features were also included such as volunteer and work experience. Demographic information such as state and gender were also found in the data.
The original internship admissions data set contained outliers such as erroneous GPA and demographic values, which were subsequently removed during data preprocessing. All visualizations and results presented below are based on the cleaned data set, which excludes the outlier rows.
At A Glance - Summer 2023
Mean Trends by Metric
Below is an interactive bar plot displaying the mean of each numerical metric (GPA, Test Score, etc.) by internship admissions decision. To switch between metrics, click on the drop down and select your metric of choice.
Code
# Import packagesimport pandas as pdimport numpy as npimport altair as altimport matplotlib.pyplot as pltimport seaborn as snsimport plotly.graph_objects as goimport plotly.express as pximport plotly.io as piofrom mpl_toolkits.mplot3d import Axes3Dfrom vega_datasets import data# Read in clean datadf = pd.read_csv("../data/clean_data.csv")# Get mean and median dfsmeans = df[["Decision", "GPA", "WorkExp", "TestScore", "WritingScore", "VolunteerLevel"]].groupby("Decision").agg(["mean"]).reset_index()means.columns = means.columns.droplevel(1)means.columns = ["Decision", "GPA", "Work Experience", "Test Score", "Writing Score", "Volunteer Level"]medians = df[["Decision", "GPA", "WorkExp", "TestScore", "WritingScore", "VolunteerLevel"]].groupby("Decision").agg(["median"]).reset_index()medians.columns = medians.columns.droplevel(1)medians.columns = ["Decision", "GPA", "Work Experience", "Test Score", "Writing Score", "Volunteer Level"]# Plotly visualvariables = ["GPA", "Test Score", "Writing Score", "Work Experience", "Volunteer Level"]order = ["Admit", "Waitlist", "Decline"]pio.renderers.default ="plotly_mimetype+notebook"# Add tracesplot = go.Figure(data=[ go.Bar( name ="GPA", x = means["Decision"], y = means["GPA"], marker_color ="#0E6BA8" ), go.Bar( name ="Test Score", x = means["Decision"], y = means["Test Score"], marker_color ="#6F0624", visible =False ), go.Bar( name ="Writing Score", x = means["Decision"], y = means["Writing Score"], marker_color ="#8B748F", visible =False ), go.Bar( name ="Work Experience", x = means["Decision"], y = means["Work Experience"], marker_color ="#00072D", visible =False ), go.Bar( name ="Volunteer Level", x = means["Decision"], y = means["Volunteer Level"], marker_color ="#0A2472", visible =False ) ]) # Set the initial view to JUST be GPAinitial_view = {"visible": [True, False, False, False, False]}# List of titles to usetitles = ["Mean GPA", "Mean Test Score", "Mean Writing Score", "Mean Years of Work Experience", "Mean Volunteer Level"]# Dropdownplot.update_layout( updatemenus=[ dict( active =0, x =-0.1, y =0.7, buttons=list([ dict(label = variable, method ="update", args=[{"visible": [i == j for i inrange(len(variables))]}, {"title": f"{titles[j]} by Admissions Decision", "xaxis_title": "Admissions Decision", "yaxis_title": titles[j], "xaxis": {"categoryorder": "array", "categoryarray": order} }]) for j, variable inenumerate(variables) ]), ) ], title_text =f"{titles[0]} by Admissions Decision", xaxis =dict(categoryorder="array", categoryarray=order), showlegend =True, margin =dict(l =50, r =50, t =50, b =50)) plot.show()
Median Trends by Metric
Below is an interactive bar plot displaying the median of each numerical metric (GPA, Test Score, etc.) by internship admissions decision. To switch between metrics, click on the drop down and select your metric of choice.
Code
pio.renderers.default ="plotly_mimetype+notebook"# Add tracesplot = go.Figure(data=[ go.Bar( name ="GPA", x = medians["Decision"], y = medians["GPA"], marker_color ="#0E6BA8" ), go.Bar( name ="Test Score", x = medians["Decision"], y = medians["Test Score"], marker_color ="#6F0624", visible =False ), go.Bar( name ="Writing Score", x = medians["Decision"], y = medians["Writing Score"], marker_color ="#8B748F", visible =False ), go.Bar( name ="Work Experience", x = medians["Decision"], y = medians["Work Experience"], marker_color ="#00072D", visible =False ), go.Bar( name ="Volunteer Level", x = medians["Decision"], y = medians["Volunteer Level"], marker_color ="#0A2472", visible =False ) ]) # Set the initial view to JUST be GPAinitial_view = {"visible": [True, False, False, False, False]}# List of titles to usetitles = ["Median GPA", "Median Test Score", "Median Writing Score", "Median Years of Work Experience", "Median Volunteer Level"]# Dropdownplot.update_layout( updatemenus=[ dict( active =0, x =-0.1, y =0.7, buttons=list([ dict(label = variable, method ="update", args=[{"visible": [i == j for i inrange(len(variables))]}, {"title": f"{titles[j]} by Admissions Decision", "xaxis_title": "Admissions Decision", "yaxis_title": titles[j], "xaxis": {"categoryorder": "array", "categoryarray": order} }]) for j, variable inenumerate(variables) ]), ) ], title_text =f"{titles[0]} by Admissions Decision", xaxis =dict(categoryorder="array", categoryarray=order), showlegend =True, margin =dict(l =50, r =50, t =50, b =50)) plot.show()
Demographic Insights
The following section breaks down the relationship between demographics (gender & state) and internship application decisions.
Below are the number of students per state and decision. Note that for most states and decisions there are only a handful of students in each row. This means that the analysis conducted later cannot be representative of the entire population.
Code
import pandas as pdimport altair as altimport seaborn as snsimport plotly.express as pximport plotly.io as pioimport matplotlib.pyplot as pltfrom vega_datasets import datadf = pd.read_csv('../data/clean_data.csv')decision_count = df.groupby(['Decision', 'State']).size().reset_index()decision_count = decision_count.rename(columns={0: 'Count'})
Decision
State
Count
Admit
California
9
Admit
Colorado
8
Admit
Florida
11
Admit
Utah
1
Decline
California
1
Decline
Colorado
6
Decline
Florida
13
Decline
Mississippi
1
Decline
Oregon
1
Decline
Utah
2
Decline
Virginia
4
Waitlist
Alabama
1
Waitlist
California
2
Waitlist
Colorado
4
Waitlist
Florida
11
Waitlist
New York
1
Waitlist
Utah
3
Waitlist
Vermont
1
Geographics
To provide an overview of the data, we will be looking at the data from a geographic perspective, specifically at the state level.
Above is a choropleth map of the average numeric feature (GPA, test score, writing score, work experience in years, and volunteer level) by state. The average of the numeric features is calculated across all decision types to obtain a holistic view of the student data by state. Below we will summarize some findings for each feature:
GPA
Test Score
Writing Score
California has the highest average GPA, with Florida and New York close behind.
California has the highest average test score.
California has the highest average writing score.
Oregon and Mississippi have the lowest average GPA.
Mississippi has the lowest average test score.
New York has the lowest average writing score.
Work Experience
Volunteer Level
Mississippi has the highest average work experience in years.
Oregon has the highest average volunteer level.
Oregon has the lowest average work experience.
Alabama has the lowest average volunteer level.
We can also look at some of these features at the geographic level by decision.
As we can see from the average GPA and test scores for admitted and declined students by state, students who were admitted had higher GPAs and test scores than those who were declined.
This insight can help us to improve students’ test scores so as to increase their chances of being admitted to an internship.
Decision Rates by State
We can also see the rates of students admitted and declined from internships by state to see overall how successful are the students from the selected states.
Code
#create dataframe of rates for each state by decisiondecision_state = df.groupby(['Decision', 'State'])[["GPA"]].count().reset_index()decision_state = decision_state.rename(columns={'GPA':'StateCount'})decision_state['DecisionCount'] = decision_state.groupby('Decision')['StateCount'].transform('sum')decision_state['Rate'] = decision_state['StateCount'] / decision_state['DecisionCount'] *100state_id_dict =dict(zip(data.population_engineers_hurricanes()["state"], data.population_engineers_hurricanes()["id"]))decision_state["StateID"] = decision_state["State"].map(state_id_dict)admit_states = decision_state[decision_state['Decision'] =="Admit"]decline_states = decision_state[decision_state['Decision'] =="Decline"]
Above are the maps of the rates of the students admitted by state and the rates of the students declined by state. Some findings from the maps are:
Florida had the highest rate of admitted students.
Utah had the lowest rate of admitted students.
Florida also has the highest rate of rejected students.
California, Oregon, and Mississippi all have the lowest rate of rejected students.
There isn’t a clear relationship between admissions and rejections by state, which means that the state a student is from is not pivotal in the decision of their internship application.
Gender
It is important to establish that internship opportunities are given fairly and equitably to all students regardless of gender. Analyzing decisions by gender can highlight any discrepancies or biases in the selection process, which is the primary focus of the following section.
Code
%%html<figure><img src="../website/images/heatmap.png" style="display: block; margin-left: auto; margin-right: auto;"><figcaption style="text-align: center;">Figure 1: Heatmap of student internship decisions by gender to highlight any biases that may appear.</figcaption></figure>
Figure 1: Heatmap of student internship decisions by gender to highlight any biases that may appear.
The heatmap above display internship decision counts among females and males. Because the colors represent counts in very similar ranges, it appears gender is not a contribuing factor to internship decisions. However, it’s important to back that statement with statistics, such as a Chi-Square test. Due to the relatively small size of the data set, a chi-square test may not always be accurate. An exact test was also performed, Fisher’s Exact Test, to verify the result from the Chi-Square test.
The Chi-Square statistic measures the difference between gender frequencies in each decision category within the data and gender frequencies in each decision that would be expected if there was no association between the variables. A lower chi-square value as seen above, indicates that the observed frequencies are very close to the expected frequencies. The large P-value (greater than 0.05) confirms that there is no significant association between gender and admission decisions, indicating no evidence of bias based on this test; a desired result.
Code
from scipy.stats import fisher_exactcont_table_small = cont_table[["Admit", "Decline"]]odds_ratio, p_value_fish = fisher_exact(cont_table_small)
Statistic - Fisher’s Exact Test
Value
Fisher’s Odds Ratio
1.0833
Fisher’s Test P-Value
1.0
For Fisher’s Exact Test, the odds ratio is the ratio of the odds of an event occurring in one group compared to another. There is a positive association if the odds ratio is greater than 1. The odds ratio of 1.083 means that the odds of being admitted for one group are 8.3% higher than the odds of the other group being admitted. While Fisher’s Exact found a difference, the p-value for the test is equal to 1, which is higher than above any given significance level. This verifies the results of the Chi-Square test, indicating no evidence of bias between admisisons and gender.
Academic & Holistic Insights
Though the usage of pariplots and machine learning techniques, relationships between student’s academic features (GPA, writing score, test score) and the internship application outcome can be understood. This information can help students at Data Tech understand what features of their application may contribute to internship decisions and how strongly. Futhremore, this analysis can provide insight into areas the curriculum that may or may not need targeted attention to ensure the students increase their internship admission chances.
Above is a pairplot of GPA, writing score, and test score of the students grouped by the decision. When looking at the scatterplots, we notice some patterns:
Students with low test score, no matter the GPA, were declined.
Students with high test score and high GPA were accepted.
Students with a pretty high GPA but average test score were waitlisted.
Students with high test score, no matter the writing score, were admitted.
Students with a low test score, no matter the writing score, were declined.
Students with high writing scores but average test score were waitlisted.
Through the pairplot, it is apparent that some of the academic features have relationships by decision result, but some features seem to be more important than others.
To understand the factors influencing college student internship decisions, we employed a tree-based machine learning model, specifically XGBoost, to analyze the data. To interpret the model’s predictions and assess the impact of each factor, we utilized Shapley values, a concept from cooperative game theory. This analysis enables us to identify which factors most strongly influence internship decisions. Gaining insights into these relationships will help us pinpoint areas for improvement in the college curriculum, ensuring that students are well-prepared and have the highest likelihood of securing summer internships.
Code
%%html<figure><img src="../website/images/shap.png" style="display: block; margin-left: auto; margin-right: auto;"><figcaption style="text-align: center;">Figure 1: Visualization of SHAP values indicating the overall impact of various features on internship decisions.</figcaption></figure>
Figure 1: Visualization of SHAP values indicating the overall impact of various features on internship decisions.
The figure above quantifies student’s academic and holistic attributes influence on internship application outcomes overall. Higher SHAP values mean those features have a greater impact internship decisions. Conversely, lower SHAP values indicate factors that are less important in the internship decision making process.
As we can see, test scores, GPA, and writing scores are among the top contributors while features such as work experience and volunteer level are not weighed as heaviliy. This highlights that companies are looking to the student’s academic background as a main focus for their decision compared to their holistic attributes.
Code
%%html<figure><img src="../website/images/spec_shap.png" style="display: block; margin-left: auto; margin-right: auto;"><figcaption style="text-align: center;">Figure 2: Visualization of SHAP values indicating the impact of various features on internship application outcomes by decision.</figcaption></figure>
Figure 2: Visualization of SHAP values indicating the impact of various features on internship application outcomes by decision.
While the previous plot displayed how student attributes play a role in the internship decision making process overall, Shapley statistics allow further steps to be taken by analyzing how each feature contributes to each possible deicion (Admit, Waitlist, Decline).
The figure above displays just that. It appears that the test score is the most significant factor contributing to a student’s likelihood of being admitted. It falls in line with the earlier pairplot as we saw significant overlap in internship decisions among GPAs and writing scores, but distinct separation for test scores. This suggests that students with higher test scores have a greater advantage in the competitive internship landscape.
For students who are placed on a waitlist, both test scores and GPA are important considerations. This might indicate that students on the waitlist have comparatively lower test scores than those who are admitted outright. The pairplot displays exactly that so it seems that for these students, academic performance is a deciding factor that could tip the balance in their favor for internship admission.
On the other hand, for students who are declined, the test score still holds considerable weight, but the writing score becomes notably more influential. This pattern could imply that declined students, while possibly having adequate test scores, may fall short in demonstrating the necessary writing proficiency, which is critical for many internships that require strong communication skills.
Conclusions
These insights suggest a couple of strategic focuses for the institution:
Test Score Improvement: Continue to prioritize and enhance test preparation services, ensuring that the students can achieve the highest scores possible.
Academic Support: Given the importance of GPA, particularly for waitlisted students, bolstering academic support can help these students improve their standing and increase their chances of moving from waitlist to admit.
Writing Proficiency: Addressing the writing skills that impact both waitlisted and declined students, consider expanding the writing centers and integrating more communication-focused workshops into the student services.
By concentrating on these areas, the college will help its students to not only meet but exceed the expectations of internship programs, thereby improving their chances of being admitted.